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Abstract — Controlling and analyzing cyberphysical and robotics 
systems is increasingly becoming a Big Data challenge. Pushing this 
data to, and processing in the cloud is more efficient than on-board 
processing. However, current cloud-based solutions are not suitable 
for the latency requirements of these applications. We present 
a new concept, Discretized Streams or D-Streams, that enables 
massively scalable computations on streaming data with latencies 
as short as a second. We experiment with an implementation 
of D- Streams on top of the Spark computing framework. We 
demonstrate the usefulness of this concept with a novel algorithm 
to estimate vehicular traffic in urban networks. Our online EM 
algorithm can estimate traffic on a very large city network (the 
San Francisco Bay Area) by processing tens of thousands of 
observations per second, with a latency of a few seconds. 

Note to Practitioners 

This work was driven by the need to estimate vehicular 
traffic at a large scale, in an online setting, using commodity 
hardware. Machine Learning algorithms combined with streaming 
data are not new, but it still requires deep expertise both in 
Machine Learning and in Computer Systems to achieve large scale 
computations in a tractable manner. The Streaming Spark project 
aims at providing an interface that abstracts out all the technical 
details of the computation platform (cloud, HPC, workstation, etc.). 

As shown in this work. Streaming Spark is suitable for imple- 
menting and calibrating non-trivial algorithms on a large cluster, 
and provides an intuitive yet powerful programming interface. 
The readers are invited to refer to the source code referred in this 
article for more examples. 

This article presents algorithms to sample and compute densities 
for Gamma random variables restricted to a hyperplane (i.e. 
distributions of the form Ti\^. ajTj — d with Tj independant 
Gamma distributions). It is common in this case to use Gaussian 
random variables because of closed form solutions to solve. If 
one considers positive valued distributions with heavy tails, our 
formulas using gamma distributions may be more suitable. 

Keyword: Streaming Spark, Arterial traffic estimation. 

Index Terms — Streaming, Expectation-Maximization, Large- 
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D 

L Introduction 

Cyber-physical systems involve a complex integration of 
physical and computational processes. Such systems usually 
integrate two distinct components - (i) a set of sensors that 
continuously produce streaming data (ii) a set of communication 
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and computation systems that aggregate data and perform data 
analytics. Large-scale cyber-physical systems can be found 
everywhere - from intricate control systems used in robotics 
to complex environment sensing of civil infrastructures. The 
data produced by such systems can be very large (millions of 
records/seconds) and the amount of computation necessary to 
interpret the data can be significant. 

Furthermore, over the last decade, the cost of sensing and 
communication equipment has fallen dramatically to the point 
that it is less expensive to incorporate a large number of 
sensors collecting low- value information, rather than judiciously 
deploying a limited number of sophisticated and more accurate 
measurement systems. The corresponding fall in costs shifts the 
burden from carefully designing instrumentation to correctly fil- 
tering and interpreting the wealth of information available to the 
researcher. This paradigm shift leads to two problems: storage 
and processing. It is not clear which part of the information is 
relevant so potentially all the information needs to be saved, 
which may be expensive: in order to find the proverbial needle 
in the haystack, one needs enough room for the haystack. In 
addition, small, cheap, unreliable sensors provide information 
that is more noisy and potentially requires more processing 
compared to dedicated sensors. 

Industries such as genomics and astronomy have learned to 
cope with extremely large datasets over the last decade. What 
makes cyberphysical systems truly stand out amongst these 
applications is the very fast decay of the value of information: in 
robotics systems for example, the data collected is usually fed 
into a control system. Past information is often of limited or 
no value, sometimes as fast as in the span of a few minutes or 
tens of seconds. This is unlike genomic records which, rather 
than being processed immediately, need to be stored reliably 
for a long time. In essence, the incoming information in cyber- 
physical systems needs to be processed as a stream, and not so 
much as an ever-growing dataset. 

The problem of processing large incoming streams of data 
has received little attention so far, because most work has fo- 
cused on adapting existing technologies which are either design 
scalability or latency but not both. Streaming databases | 6 | can 
provide the necessary low latencies but are limited in scalability. 
On the other hand, scalable batch processing systems like 
MapReduce or Hadoop 1 1 1 are designed for scaling to thousands 
of machines but perform poorly in terms of latency. Latency 
is not a concern for many applications, and for these running 
regular batch jobs with traditional batch systems is appropriate. 
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But this is often insufficient for many cyber-physical systems 
such as robotics. Working at lower latencies (at the scale of 
seconds or tens of seconds) presents significant challenges. As 
a solution to this problem, we use D-Streams |33|, a recently 
proposed programming model where streaming computations 
are decomposed into a series of batch computations on small 
time intervals. This model provides two significant benefits. 

• Scalability with low latency: Stream processing appli- 
cations, implemented using D- Streams to scale to large 
clusters (hundreds of cores) while providing latencies as 
low as hundreds of milliseconds. 

• Simple high-level programming API: The D-Stream ab- 
straction and its associated operations makes it very conve- 
nient for a developer to implement complex business logic 
to process the raw sensor data. 

In this article, we investigate the use of Spark Streaming, a 
system implementing D-Streams, with a large scale estimation 
problem: inferring the state of traffic on a large road network 
by using streams of GPS readings. Traffic congestion affects 
nearly everyone in the world due to the environmental damage 
and transportation delays it causes. The 2007 Urban Mobility 
Report 1 31 1 states that traffic congestion causes 4.2 billion 
hours of extra travel in the United States every year, which 
accounts for 2.9 billion extra gallons of fuel and an additional 
cost of $78 billion. Providing drivers with accurate traffic 
information reduces the stress associated with congestion and 
allows drivers to make informed decisions, which generally 
increases the efficiency of the entire road network |9|. 

Modeling highway traffic conditions has been well- studied 
by the transportation community with work dating back to 
the pioneering work of Lighthill, Whitham and Richards p2| . 
Recently, researchers demonstrated that estimating highway 
traffic conditions can be done using only GPS probe vehicle 
data |32|. Arterial roads, which are major urban city streets that 
connect population centers within and between cities, provide 
additional challenges for traffic estimation. Recent studies fo- 
cusing on estimating real-time arterial traffic conditions have 
investigated traffic flow reconstruction for single intersections 
using dedicated traffic sensors. Dedicated traffic sensors are 
expensive to install, maintain and operate, which limits the 
number of sensors that governmental agencies can deploy on 
the road network. The lack of sensor coverage across the arterial 
network thus motivates the use of GPS probe vehicle data for 
estimating traffic conditions. 

Recent studies focusing on estimating real-time arterial traf- 
fic conditions have investigated traffic flow reconstruction for 
single intersections iTOj, pTl using dedicated traffic sensors. 
We consider an estimation engine deployed inside the Mobile 
Millennium traffic information system fT\, f20\. This engine 
gathers GPS observations from participating vehicles and pro- 
duces estimates of the travel times on the road network. Mobile 
Millennium is intended to work at the scale of large metropolitan 
areas: the road network considered in this work is a real road 
network (a large portion of the greater Bay Area, comprising 
506,685 road links) and the data for this work is collected from 
thousands of vehicles that generate millions of observations per 



day. As a consequence of these specifications and requirements, 
we employ highly scalable traffic algorithms. 

The specific problem we address in this use case is how to 
extract travel time distributions from sparse, noisy GPS mea- 
surements collected in real-time from vehicles. A probabilistic 
model of travel times on the arterial network is presented along 
with an online Expectation Maximization (EM) algorithm for 
learning the parameters of this model (Section [III]). The algo- 
rithm is expensive due to the large dimension of the network and 
the complexity inherent to the evolution of traffic. Furthermore, 
our EM algorithm has no closed-form expression and requires 
sampling and non-linear optimization techniques. This is why 
the use of a distributed system is appropriate. We will present 
D-Streams in more detail in Section [III 

The present work is novel for several reasons. First, it 
advances research in traffic estimation by presenting a novel 
travel time estimation algorithm that is highly scalable, uses 
data commonly available nowadays, and is robust to noise 
and other random perturbations. This algorithm builds upon a 
novel statistical distribution: the Gamma-Dirichlet distribution 
(formally introduced in Appendix A). Second, it shows that 
it is possible to use complex, multistage filtering algorithms 
on very large systems with a latencies under a few seconds. 
Third, it explores the tradeoffs between computational power, 
timeliness, and accuracy of estimation of the travel times outputs 
and shows that the estimates gracefully degrade with less data. 
Fourth, the workflow of this algorithm is representative of a 
large class of Machine Learning and estimation algorithms. We 
believe the good system performance results obtained for this 
particular application (Section [IV| ) hint at potentially significant 
speedups for other distributed estimation algorithms, and are of 
interest to researchers using cloud computing for large physical 
systems, and for the Machine Learning community. 

We start by introducing the D-Streams programming model 
(Section [11]) and the problem of traffic estimation (Section 
|In|. We then give an overview of our estimation algorithm 



(Section |in-B| ) and we explain how we used Spark Streaming 
to parallelize the algorithm (Section |in-D| ). We evaluate our 
implementation in Section[lV|from the perspective of scalability 



(Section IV-B) and accuracy (Section IV-C). The derivations 
related to the properties of the Gamma-Dirichlet distribution 
are included in the Appendix A. 

IL DiSCRETIZED STREAMS: LARGE-SCALE REAL-TIME 
PROCESSING OF DATA STREAMS 

Discretized stream (D-Stream) is a recently proposed pro- 
gramming model p3| for processing of streaming data that 
allows complex machine learning algorithms to be easily ex- 
pressed and executed on large streams of live data. In this sec- 
tion, we will first discuss the limitations of existing techniques 
of processing live data. Then we will elaborate on D-Streams 
highlighting its advantages over existing techniques. 

A. Limitations of current techniques 

Current techniques to process large amounts of live streaming 
data can be broadly classified into the following two categories. 
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• Using traditional streaming processing systems: Streaming 
databases like StreamBase |^ and Telegraph p2| , and 
stream processing systems like Storm |28| have been 
used to meet such processing requirements. While they 
do achieve low latencies, they either have limited fault- 
tolerance properties (data lost on machine failure) or lim- 
ited scalability (cannot be run on large clusters). 

• Using traditional batch processing systems: The live data is 
stored reliably in a replicated file system like HDFS 1 1 1 and 
later processed in large batches (minutes to hours) using 
traditional batch processing frameworks like Hadoop 

By design, these systems can process large volumes of data 
on large clusters in a fault-tolerant manner, but they can 
only achieve latencies of minutes at best. Furthermore, the 
processing model is too low level to conveniently express 
complex stream computations. 

B. D-Stream - A programming model for stream processing 

D-Streams execute deterministic computations similar to 
those in MapReduce for fault tolerance, but they do so at a 
much lower latency than previous systems, by keeping state in 
memory. The input data received from various input sources 
(e.g., webservices, sensors, etc) during each interval is stored 
reliably across the cluster to form an input dataset for that 
interval. Once the time interval completes, this dataset is pro- 
cessed via deterministic parallel operations (like map, filter, 
reduce, groupBy, etc) to produce new datasets representing 
program outputs or intermediate states. Finally, these datasets 
can be saved to external source (databases, etc) or aggregate all 
the values into some gradient update / expected loss estimate. 
The advantage of this model is that it provides the developer 
a convenient high-level programming model to easily express 
complex stream computations while allowing the underlying 
system to process the data in small batches thus achieving 
excellent fault-tolerance properties. 

Going into more details, each dataset created in the time inter- 
vals is represented as a Resilient Distributed Dataset (RDD) p4| 
which is an efficient storage abstraction that keeps a distributed 
collection of data in memory (as opposed to writing it to 
the disk) to guarantee fast access. A D-Stream is therefore a 
series of RDDs and lets the user manipulate them collectively 
through various deterministic parallel operations. We illustrate 
this abstraction and a few operators with a small program 
(written for Spark Streaming, an implementation of D-Streams) 
that computes a running count of page view events by URL. 

val pageViews = readStream ( "http ://..." , "Is") 
val ones = pageViews . map (evt => (evt.url, 1)) 
val counts = ones . runningReduce ( (a, b) => a+b) 

This code creates a D-Stream called pageViews by reading 
an event stream over HTTP, and divides the streams into 
batches of 1 -second intervals. It then transforms the event 
stream using the map operator to get a D-Stream of (URL, 1) 
pairs called ones, and performs a running count of these using 
the runningReduce operator. The arguments to map and 
runningReduce are Scala syntax for a closure (function 



literal). 

D-Stream operators: D-Streams provide two types of opera- 
tors to let users build streaming programs: 

• Transformation operators, which produce a new D-Stream 
from one or more parent streams. These can be either 
stateless (i.e., act independently on each interval) or stateful 
(share data across intervals). 

• Output operators, which let the program write data to 
external systems (e.g., save processed data to a database). 

D-Streams support the same stateless transformations available 
in typical batch frameworks, including map, reduce, groupBy, 
and join. In addition, D-Stream also provide stateful operators 
like windowing and moving average operators that share data 
across time intervals. The runningReduce operator in the earlier 
page view program is an example of a stateful operator as it 
combines page views across time intervals. 

Furthermore, since the D-Stream abstraction follow the same 
processing model as batch systems, the two can naturally be 
combined. For example, one may not only join two streams of 
data, but also join a data stream with a batch data - joining a 
stream of incoming tweets against a pre-computed spam filter 
or historical data. 

Fault-tolerance properties: All the intermediate data com- 
puted using D-Stream are by design fault-tolerant, that is, no 
data is lost if any machine fails. This is achieved by treating 
each batch of data as an RDD. Each RDD maintains a lineage 
of operations that was used to create it from the raw input data 
(stored reliably by the system by automatic replication) |34|. 
Hence, in the event of failure, if any partition of an RDD is 
lost, it can be recomputed from raw input data using the lineage. 
As these operations are deterministic and functional by nature 
(i.e. side-effect free), the recomputation can be done using fine- 
grained tasks in parallel on many number. This ensures fast 
recovery minimizing the effect of the failure on the stream 
processing system. This novel technique is called parallel- 
recovery and sets this abstraction apart from existing stream 
processing systems. 

C. Spark Streaming- An implementation of D-Streams 

To implement D-Streams, we use Spark, an existing open- 
source, batch processing framework, to create Spark Streaming. 
Spark is a fast, in-memory batch processing framework built on 
the RDD abstraction, and we naturally extend this framework 
to implement D-Streams. Both these systems are implemented 
in Scala f4l| (a language based on the Java Virtual Machine), 
which allows them integrate well with existing Java and Scala 
libraries for linear algebra, machine learning, etc. Furthermore, 
the compact syntax of the Scala language hides all the complex- 
ities of distribution, replication and data access pattern behind 
an intuitive programming interface. A relevant portion of the 
code of the algorithm is provided in Appendix B. This code 
instantiates a D-Stream with the raw data and derives some other 
D-Streams that correspond to each step of the algorithm. As can 
be seen, this code leverages the functional API of Spark and 
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Scala to express stream transformations in a very natural way. 
Spark Streaming can scale to hundreds of cores while achieving 
latencies as low as hundreds of milliseconds. We use this system 
to implement our traffic estimation algorithms, which we shall 
explain next. 

III. Scalable traffic estimation from streaming 

DATA 

We now explain the relevance of D- Streams with a use case 
of large-scale, low-latency state estimation: vehicular traffic 
estimation. The goal of the traffic estimation algorithm is to 
infer the travel times of each link of an arterial road network, 
given periodic GPS readings from vehicles moving through 
the network. We will describe in this section our estimation 
framework. We will discuss the performance gains obtained by 
using D-Streams in Section |IV] 

We define the road network as a graph V = (V,f), where 
the set £ will be referred to as the "links" of the road network 
(streets) and C as the "nodes" (road intersections). For each 
link / G the algorithm outputs Xl, the time it takes at time 
index t to traverse link I. This time is described as a probability 
distribution parametrized by a vector vi. Our goal is then to 
estimate X^, the joint distribution of all link travel times across 
all links in 8, for each time index t. We assume that the traffic is 
varying slowly enough that it can be considered a steady state 
between each evaluation: our algorithm will consider that all 
the observations between two consecutive time steps have been 
generated according to the same state. To simplify notations, 
we will consider a single time interval and drop the reference to 
time: the joint distribution of travel time is the multidimensional 
variable X. 

We will first give an overview of the GPS data that is 
commercially available today, and an algorithm that converts 
raw GPS points to map-matched trajectories with high accuracy: 
the Path Inference Filter (PIF) |T9| . We will then present our 
modeling approach to infer the traffic conditions from these GPS 
observations. Then we will explain how the Mobile Millennium 
| [20| pipeline implements this algorithm using a computing cloud 
and Streaming Spark as a computation backend. 

A. Map-matching GPS probe data with the Path Inference Filter 

In order to reduce power consumption and transmission costs, 
probe vehicles do not continuously report their location to the 
base station. A high temporal resolution gives access to the 
complete and precise trajectory of the vehicle, but this causes the 
device to consume more power and communication bandwidth. 
Also, such data is not available at large scale today, except in a 
very fragment portion of the the private sector. A low temporal 
resolution carries some uncertainty as to which trajectory was 
followed. In the case of a high temporal resolution (typically, 
a frequency greater than an observation per second), some 
highly successful methods have been developed for continuous 
estimation |fT5|, |[24|, pOj. However, most data collected at 
large scale today is generated by commercial fleet vehicles. 
It is primarily used for tracking the vehicles and usually has 
a low temporal resolution (1 to 2 minutes) |[3|, |[TT|, pT| , 
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Figure 1. Example of observation. The green mark represents an initial GPS 
reading, the orange mark represents a subsequent reading. The black line marks 
the path of the vehicle, as reconstructed by the Path Inference Filter between 
the two GPS points and the numbers are the indexes of each road link covered 
by this observation. Given a realization of the travel time distribution at 
time 1 = 4, all the information on travel times encoded by this observation is 
summarized in the equation above. 



|29|. In the span of a minute, a vehicle in a city can cover 
several blocks (see Figure [T] for an example). Information on the 
precise path followed by the vehicle is lost. Furthermore, due 
to GPS localization errors, recovering the location of a vehicle 
that just sent an observation is a non trivial task: there are 
usually several streets that could be compatible with any given 
GPS observation. Simple deterministic algorithms to reconstruct 
trajectories fail due to misprojection or shortcuts. The Path 
Inference Filter (19] is a probabilistic framework that recovers 
trajectories and road positions from low-frequency probe data 
in real time, and in a computationally efficient manner. 

This algorithm first projects the raw points onto candidate 
projections on the road network and then builds candidate 
trajectories to link these candidate projections. An observation 
model and a driver model are then combined in a Conditional 
Random Field to find the most probable trajectories, using the 
Viterbi algorithm. More precisely, the algorithm performs the 
following steps: 

• We map each point of raw (and possibly noisy) GPS data 
to a collection of nearby candidate projections on the road 
network (Figure 2-1). 

• For each vehicle, we reconstruct the most likely trajectory 
using a Conditional Random Field |19 | (Figure 2-2). 

• Each segment of the trajectory between two GPS points 
is referred as an trajectory measurement (Figure 2-3). A 
trajectory measurement consists in a start time, an end 
time and a route on the road network. This route may 
span multiple road links, and starts and ends at some offset 
within some links. 

At the output of the PIF, we have transformed sequences of GPS 
readings into sequences of trajectory readings. These readings 
are the input for our travel time estimation algorithm. 




Figure 2. Map-matching algorithm: the raw GPS readings a first projected 
onto candidate points on the road network (Step 1). Then all feasible paths 
between each pair of candidate points are computed (Step 2). A dynamic 
programing algorithm then finds the most likely trajectory, using a Conditional 
Random Field (Step 3). Trajectory measurements are the input to the Expectation 
Maximization algorithm. This algorithm outputs distributions of travel times 
(Step 4). 



B. Fundamental generative model 

An example of a trajectory reading is given in Figure [T] 
Estimating the travel time distributions is made difficult by the 
fact that we do not observe travel times for individual links. 
Instead, each reading only specifies the total travel time for an 
entire list of links traveled. We formally describe our estimation 
task as a maximum likelihood estimation problem. 

We consider one reading, described by an offset on a first 
road link Ostart, an offset on a last link Oend, a list of m visited 
links /i • • • a start time, and a travel duration d. We simplify 
the problem by assuming that the partial travel time from the 
start of a road link to some offset o is proportional to the offset: 
the travel time between the start of the link / and offset o is a 
probability distribution Xpartiai (0, o) = j^^i where L (/) is 
the length of link Using this assumption, we can convert the 
description of an observation into a vector form: consider the 
vector a G M.^ where: 



a (Im) 



lAh) 



C>end 
Lilm) 



a{k) = 1 for i G [2...m- 1] 

and a (/) = for all other links I. Thanks to the proportionality 
assumption, the observed travel duration d along the path p is 



Figure 3. Directed (Bayesian) graph of the travel time model. Grey nodes 
are observed variables, white nodes are hidden variables. The arrows represent 
conditional dependencies between the variables. Boxes encode plates, i.e. a 
factorization of repeating variables. (Right) an expansion of a few elements of 
the plates. 



a linear combination of linear travel times: 

d — a (/) xi 
les 

lep 

The vector a is called the path activation vector for this 
reading. Note that fewer than 10 links are covered in a typi- 
cal trajectory measurement, so the path activation vectors are 
extremely sparse. We will use this fact to achieve very good 
scaling of our algorithm. 

For a given time interval, we can completely represent a 
trajectory reading by an observation Y = (a, d) G (M+)"^ xR^. 
Each observation Y'^^^ = (a'^^^ I)*^^^) describes the rth tra- 
jectory's travel time D^^"^ and path a*^^^ as inferred by earlier 
stages of the Mobile Millennium pipeline. The travel time D^^"^ 
is the time interval between consecutive GPS observations and 
is roughly one minute for our source of data. 

To make the inference problem tractable, we model the link 
travel times for each link / as a Gamma distribution with pa- 
rameter vector ui = (ki^Oi), and we assume these distributions 
are pairwise independent^ The independence assumption is 
standard in the transportation literature p7| , p8| and it also 
leads to a highly scalable estimation algorithm. We will discuss 



the validity of this assumption in Section IV-C 



The dependencies between the observations and the parameter 
vector u can be represented as a Bayesian graphical model, 
which encodes all the dependencies between the variables in a 



very compact form (Figure [riI-B| ). We now formalize the prob- 
lem of estimating the set of parameters u = {ui)^ G R^^^ for 
a set of observations a learning problem. We 

consider that the the current estimate of the traffic is completely 

^We experimented with a few standard distributions from the literature 
(Gamma, Normal and Log-normal). Based on our experiments, the Gamma 
distribution fit the data best. Computing the most likely Gamma distribution 
from a set of samples is more expensive than in the case of the Normal, but 
was deemed worthwhile for the added accuracy. 
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described by independent Gamma distributions of the travel 
times over each road hnk. These travel times are (indirectly) 
observed through a set of observations Y^^"^ = (^a^'^\ d^^"^) . 
The set of parameters that maximizes the likelihood of these 
observations is solution to the maximum likelihood problem: 



max ll{Y',iy) = ^logTr (d^"^ 



(1) 




the probability of observing the duration 

^i{a{l))xi when xi is generated according to the 
distribution tt (•; i^^). This likelihood can be decomposed using 
the relations of independence between variables: 




II 7r{Xi;ui)dXi 

II n{Xi',iyi)dXi 

Estimating the travel time distributions is made difficult by the 
fact that we do not observe travel times X for individual links. 
Instead, each observation only specifies the total travel time D 
for an entire list a of links traveled. To get around this problem, 
we use the Expectation Maximization (EM) algorithm p3| , 
|[26j. The EM algorithm operates in two phases: In the E- 
step it considers each travel time measurement and computes a 
distribution over allocations of travel time to each of the links. 
In the M-step it computes the link parameters that maximizes 
the likelihood of the travel times for the allocations made in the 
E-step. By iterating this process the EM algorithm converges 
to a set of link parameters that are a local maximum of the 
likelihood of the data. In our setting, we run the EM algorithm 
in an online fashion: for each time step, we use the previous 
time step as a value, perform a few (iterations and we monitor 
the convergence through the expected complete log-likelihood. 
This form of online EM gives good results for our application 
(Section |IV]). 

While this model is relatively simple, the use of Gamma 
distributions substantially complicates the E step, because 
sums of independent Gamma distributions have no simple 
form. A correct computation of the expected complete log- 
likelihood requires the normalization constant of each distribu- 
tion X^^^l (a*^^^) X*^^^ = D^^\ Computing this normalization 
factor for each observation is fairly expensive (it uses an 
infinite series expansion) and is critical for good accuracy in 
practice. These practical considerations are described in the 
next section, and the computational burden justifies the need 
for cloud computing. 

C Learning with Gamma distributions 

In the previous section, an observation was defined by a 
duration d > and a (sparse) vector a G (1^+) , with the 



Algorithm 1 Sampler for Gamma distributions conditioned on 
a hyperplane 

Given a G (M^)'' and > 0. 



Generate n independent samples 

Zi da~^ 
Then z ~ 



r [ki,d 



Z\a^Z = d 



constramt: 



d = a^x 



(2) 



where x is a vector of unobserved samples from independent 
Gamma distributions. The Expectation-Maximization algorithm 
relies on two operations: computing the conditional distribution 
TT (x|(i, a; i^) of the vector x when a and d are known and the 
parameter vector u is fixed, and sampling this distribution, i.e. 
computing x so that xi ~ T (ki^Oi) under the constraint [5] In 
the case of Gamma distributions, some formulas exist for these 
operations. For the sake of clarity, we only present the main 
results here, without proof or background. The reader is invited 
to refer to Appendix A for a longer exposure. 

Sampling from conditional gamma distributions. Consider 
a set of n independent Gamma distributions Ti ^ T (ki^Oi) with 
k e and G , a n-dimensional vector of positive 

numbers a G (R^)"^ and d > (j^ Call T the joint distribution 
of all TiS. The purpose of this paragraph is to present some 
practical formulas to sample and compute the density function 
of the conditional distribution: 



Z - T 



We define this distribution over the n-dimensional simplex 



Using an appropriate probability measure over the set Sa,d^ the 
probability density function of Z is: 



1 



-1 VEi at 



(3) 



with: 



Oi d ^aiOi 



and /r the density function of the Gamma distribution: 

fr(^x',k,0^ = r(Vj i9-^x^-ie-^~"'^. The normalization 

factor K (^k^O^ can be described by an infinite series, and is a 
straightforward adaptation of a result from Moshopoulos |25|. 
The proof is provided in Appendix A and uses a new distribu- 
tion, called the Gamma-Dirichlet distribution, that generalizes 
the Dirichlet distribution. 

Sampling algorithm. There happens to exist a straightfor- 
ward procedure to sample values from the conditional Gamma 

^The function F (•, •) will refer to the Gamma distribution and the function 
r (•) will refer to the Gamma function. 
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Figure 4. Schematic architecture of the Mobile Millennium system. 

distribution Z, presented in Algorithm [T] A proof that this 
algorithm gives the intended result is also given in Appendix 
A. 

D. Overview of the MM arterial pipeline 

The Mobile Millennium system incorporates a complete 
pipeline for receiving probe data, filtering it, distributing it to 
estimation engines and displaying it, all in in real-time. This 
software stack, written in Java and Scala, evaluates probabilistic 
distribution of travel times over the road links, and uses as input 
the sparse, noisy GPS measurements from probe vehicles. 

The most computation-intensive parts of this pipeline have 
all been ported to a cloud environment. We briefly describe the 
operations of the pipeline, pictured in Figure |4] 

The observations are grouped into time intervals and sent to 
a traffic estimation engine, which runs the learning algorithm 
described in the next section and returns distributions of travel 
times for each link (Figure [2]). 

The travel time distributions are then stored and broadcast 
to clients and to a web interface. Examples of means of travel 
times are shown in Figure |6] 

It is important to point out that Mobile Millennium is intended 
to work at the scale of large metropolitan areas. The road 
network considered in this work is a real road network (a large 
portion of San Francisco downtown and of the greater Bay 
Area, comprising 506,685 road links) and the data is collected 
from the field (as opposed to simulated). A consequence of this 
setting is the scalability requirement for the traffic algorithms 
we employ. Thus, from the outset, our research has focused on 
designing algorithms that could work for large urban areas with 
hundreds of thousands of links and millions of observations. 

Our algorithm needs to run expensive computations in an 
iterative fashion on incoming streaming data. As such, it is 
a perfect candidate for D- Streams. We now describe how we 
parallelized the EM algorithm using Spark's implementation of 
D-Streams. 

Figure [5] shows the data flow in the algorithm in more 
detail. In the E-step, we generate per-link travel time samples 
from whole observations; specifically, for each observation 
y(^) = {a^^\ d^'^^) , we produce a set of U weighted samples 
yj^r) _ |(^^(^.^)^'^(^.^)^ each sample x^^'^^ produced 



k 



► xw x(^) 




Figure 5. System workflow of the EM algorithm. In the E-step, we generate 
per-link travel time samples from whole observations; specifically, for each 
observation Y^^^ = (a^^\d^^^), we produce a set of U weighted samples 
X(^) — { (x^^'^), lu^^'^)) } each sample x^^'^^ produced by ran- 

domly dividing travel time d^'^ among its constituent links (producing a travel 
time x[^'^^ for each edge k G a^^^). We assign a weight w^^'^'> as the 
likelihood of travel time a;^^'^^ according to the current distribution parameters 
v. In the shuffle step, we regroup the samples X^^^ by link, so that each link 
/ now has samples X; = | w^^'^^^ | from all the observations r 

that go over it. In the M-step, we recompute the parameters ui to fit link Vs 
travel time distribution to the samples X^. 



by randomly dividing travel time d'^^^ among its constituent links 
(producing a travel time for each edge k G a*^^^). We 

assign a weight w^^^'^^ as the likelihood of travel time x*^^'^^ 
according to the current distribution parameters z/. In the shuffle 
step, we regroup the samples X*^^^ by link, so that each link 
/ now has samples X/ = | (^^[^'^^ | from all the 

observations r that go over it. In the M-step, we recompute 
the parameters vi to fit link Vs travel time distribution to the 
samples X/. 

We note that our EM algorithm is representative of a large 
class of iterative machine learning algorithms, including clus- 
tering, classification, and regression methods, for which popular 
cloud programming frameworks like Hadoop and Dryad are 
often inefficient (T6j, |[23l. Our lessons with Streaming Spark 
are likely applicable to these applications too. 



IV. A CASE study: taxis in the San Francisco Bay 
Area 

Having now described an algorithm for computing travel time 
distributions in real time on a road network, we describe our 
validation experiments. These experiments explore two settings: 

• The raw performance of the machine learning algorithm, 
given a limited amount of data and a computational budget, 

• The performance of the Streaming Spark framework in 
distributing computations across a cluster, and the com- 
putational performance improvement gained by additional 
hardware. 

The performance of the algorithm is computed by asking the 
model to give travel time distributions on unseen trajectories, 
slightly in the future. The observed travel time of the trajectory 
is then compared with the distribution provided by the model. 
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Mean travel time for a few links 




Figure 6. An example output of traffic estimates from our algorithm: mean 
travel times on different road links in the business district of San Francisco 
during a complete day. 

We measured the LI and L2 losses between the observed 
travel time and the distribution mean, and the likelihood of the 
observed travel time with respect to the predicted travel time 
distribution. This is done with different amount of data and 
different time horizons. 

The computational efficiency of the algorithm is validated 
in two steps. First, we demonstrate that our algorithm scales 
well: given twice as many computation nodes, it perform the 
same task about twice as fast. We also see that this algorithm 
is bounded by computations. Then, we demonstrate that it 
can sustain massive data flow rates under strict scheduling 
constraints: we fix a completion time of a few seconds for each 
time step, and we find the maximum flow rate under a given 
computational budget. 

A. Taxis in San Francisco 

Our implementation was run on a road network that corre- 
sponds to the greater San Francisco Bay Area (506,685 road 
links), using some taxi data provided by the Cabspotting project 
| [TT| . This dataset contains GPS samples of a few thousand 
taxicabs emitted every minute, for more than a year. All in all, 
it represents hundred of millions of GPS points. We ran our 
algorithm on a typical day (August 12th 2010, a Tuesday) with 
different settings. An example of input data is given in Figure |7] 
A typical output of travel times provided by the algorithm is 
given in Figure [6] 

B. Good scalability results using a large cluster 

In this section, we evaluate how much the cloud implemen- 
tation helped with scaling the Mobile Millennium EM traffic 
estimation algorithm. Distributing the computation across ma- 
chines provides a twofold advantage: each machine can perform 
computations in parallel, and the overall amount of memory 
available is much greater. 

Scaling. First, we evaluated how the runtime performance 
of the EM job could improve with an increasing number of 



1 
\ 




Figure 7. An example of dataset available to Mobile Millennium and processed 
by the path inference filter: taxicabs in San Francisco from the Cabspotting 
program. Large circles in red show the position of the taxis at a given time 
and small dots (in black) show past positions (during the last five hours) of the 
fleet. The position of each vehicle is observed every minute. 



nodes/cores. The job was to learn some historical traffic estimate 
for San Francisco downtown for a half-hour time-slice, using 
a large portion of the data split in one-hour intervals. This 
data included 259,215 observed trajectories, and the network 
consisted of 15,398 road links. We ran the experiment on two 
cloud platforms: the first was using Amazon EC2 ml . large 
instances with 2 cores per node, and the second was a cloud 
managed by the National Energy Research Scientific Computing 
Center (NERSC) with 4 cores per node. Figure [8] (bottom) 
shows near-linear scaling on EC2 until 80-160 cores. Figure [8] 
(top) shows near-linear scaling for all the NERSC experiments. 
The limiting factor for EC2 seems to have been network 
performance. In particular, some tasks were lost due to repeated 
connection timeouts. 

Scaling on Streaming Spark. After having found the bot- 
tlenecks in the Spark program, we wrote another version in 
Streaming Spark. The two programs are strikingly similar (see 
program listing in Appendix B). We then benchmarked the 
application. We ported this application to Spark Streaming using 
an online version of the EM algorithm that merges in new data 
every five seconds. The implementation was about 200 lines of 
Spark Streaming code, and wrapped the existing map and reduce 
functions in the offline program. In addition, we found that only 
using the real-time data could cause overfitting, because the data 
received in five seconds is so sparse. We took advantage of 
D- Streams to also combine this data with historical data from 
the same time during the past 10 days to resolve this problem. 
Figure |9] shows the performance of the algorithm on up to 80 
quad-core EC2 nodes. The algorithm scales almost perfectly, 
largely because it is so CPU-bound, and provides answers an 
order of magnitude faster than the previous implementation. 
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Figure 8. Experiments with Spark to build historical estimates of traffic, on 
NERSC (top) and Amazon EC2 (bottom). 
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Figure 9. Experiments with Streaming Spark: rate of observation processing 
for different cluster sizes. 



C Our algorithm can he adjusted for trade-offs between 
amount of data, computational resources and quality of the 
output 

We now study the accuracy of our algorithm in estimating the 
traffic. Even if we receive a large number of observations per 
day, this number is not sufficient to cover properly in real time 
all the road network: indeed, some sections of the road network 
are much less traveled than the busy downtown areas. We use 
several strategies to mitigate this spatial discrepancy: 

• We use a prior on the Gamma distribution, itself a Gamma 
distribution since the Gamma is in the exponential family 
and conjugate with itself. The parameters of this prior are 
to 70% of the speed limit in mean and 1 minute or 50% of 
the travel time in standard deviation, whichever is greater. 

• We incorporate some data from the same day before 
the current time step, weighted by an exponential decay 
scheme: the traffic in the arterial network is assumed to 
change slowly enough. 

• We also incorporate some data from previous days, corre- 



sponding to the same day of the week (Monday, Tuesday, 
etc.). Traffic is expected to follow a weekly pattern during 
the same month. 
To sunmiarize, a large number of observations are lumped 
together and weighted according to the formula: 

The half-time decaying factors Atday and At^eek are set so that 
the corresponding weight is 0.2 at the end of the window. 

Since the EM learning algorithm is not linear in the obser- 
vations, we cannot reduce each observation to some sufficient 
statistics. As the algorithm moves forward in time, each observa- 
tion will appear at different time steps with a different weight 
and needs to be reprocessed. This is a significant limitation 
from this approach, but it makes for a good testing ground of 
Streaming Spark. 

Our EM algorithm can be adjusted in several ways: 

• The number of weeks of data to look back (between 1 and 
10) 

• The time window to consider before the current observation 
(between 20 minutes and 2 hours) 

• The number of samples generated during the E-steps (10- 
100) 

• The number of EM iterations (1-5) 

• The duration of each time step (5 seconds- 15 minutes) 
The observations we process all have a duration of one minute, 
but travel times experienced by users are usually much longer 
(10 minutes to a few hours). As such, a good metric for assessing 
the quality of a model should not be on predicting travel times 
for one-minute observations, but on longer distances. Hour-long 
travels are very likely to go be spent mostly on highways, which 
is not the scope of this study, and taxicabs usually make small 
trips (10 to 30 minutes). This is why we focus our attention 
to travel times between one minute (the observations) and 30 
minutes (typical durations for taxi rides). As far as we know, 
this study of different durations is seldom done in the study of 
traffic, which limits any attempt to compare the performance 
between different algorithms. 

The longer trajectories are obtained from the Path Inference 
Filter. They are then cuts into different pieces of the same length 
(1 minute, 5 minutes, 10 minutes, 20 minutes). Each piece of 
trajectory is considered as an independent piece of trajectory 
for the purpose of travel time prediction. 

We ran the algorithm with 4 different settings: 

• SlidingBig: the most expensive setting (10 weeks of data, 2 
hours of data, 100 EM samples, 5 EM iterations, 15 time 
steps), used as the baseline for comparison. Travel time 
estimates are produced every 20 minutes, 

• SlidingBig 1: uses less data (40 minutes of data), 

• SlidingBig2: uses less data (10 days), 

• SlidingBig3: uses the same amount of data, but performs 
only a single EM iteration every 4 minutes instead of 5 
EM iterations every 20 minutes, 

• SlidingBig4: uses the same amount of data, but generates 
only 10 EM samples for each observation 

For all these experiments, the prior was fixed. 
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Figure 10. LI residuals for different settings and for different travel times. 
The dashed lines indicate the 95% confidence interval. On the x axis, the travel 
time of the the observations considered for this metric. 



We now compare the results obtained with the different 
experiments. We first turn our attention to the LI loss in 
Figure 
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As expected, the best performance is obtained for 
experiment SlidingBig, which uses the most data. Interestingly 
enough, the best performance is obtained for travels of medium 
length (4-11 minutes), and not for short trajectories. This can 
be explained by the conversion step that transforms trajectory 
readings on partial links into weighted observations on complete 
links. The relation between link travel time and location on a 
link is more complex than a linear weighting. Nevertheless, 
the model gives relatively good performance by this simple 
transform. When a vehicle is stopped at a red light, it does not 
travel along the link but still has a non-zero travel time. In this 
case, the weight of an observation is taken to be half of the total 
travel time of the link. In particular, the relative error increases 
as the duration (and the length) of travels increases. Performance 
is not too different between experiments, which suggests some 
even smaller amount of data could be considered. 

The results for the L2 loss, presented in Figure [TT] provide 
some similar, if more acute, results. The RMSE is lowest for 
small to medium travels (in the range of 3-10 minutes). 

A probabilistic metric (the log-likelihood) gives a different 
insight, as shown in Figure [12] The model best explains the data 



for very short travel times (similar to what is was trained on) but 
its precision falls down as the length of trajectories increases. 
All in all, this results should not be unexpected: this model 
with independent links cannot take into account the correlations 
that occur due to light synchronizations or drivers' behavior. As 
such, the probability density of a longer travel rapidly dilutes as 
the number of links increases As we saw with our study of LI 
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Figure 11. L2 residuals for different settings and for different travel times. 
The dashed lines indicate the 95% confidence interval. 
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Figure 12. Log-likelihood of unobserved trajectories, for different trajectory 
lengths and different settings. 



and L2 errors, the mean travel time becomes the only significant 
value of interest for longer travel times. In the light of this result, 
there seems to be little to gain by modeling travel time with 
physically realistic, link-based, independent distributions, as the 
independence assumption will strongly weigh on the quality 
of the travel time for longer travels. Instead, we recommend 
focusing effort on simpler models of travel times that take into 
account the correlations between links. 

Conclusion 

As datasets grow in size, some new strategies are required 
to perform meaningful computations in a short amount of time. 
We explored using a new technique, Discretized Streams or D- 
Streams, that offers some significant advantages for implement- 
ing large-scale state estimation in near-real- time. D-Streams 
were implemented in the Spark computing framework. This 
approach was validated with a real-life, large-scale estimation 
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problem: vehicular traffic estimation. Our traffic algorithm is 
an Expectation-Maximization algorithm that computes travel 
time distributions of traffic by incremental online updates. This 
algorithm seems to compare favorably with the state of the 
art and shows some attractive features from an implementation 
perspective. When distributed on a cluster, this algorithm scales 
to very large road networks (half a million road links, tens of 
thousands of observations per second) and can update traffic 
state in a few seconds. 

In order to foster research in systems and in traffic, the authors 
have released the code of Spark Streaming the code of 
the EM traffic algorithm |7|, and the dataset used for these 
experiments I^TJ. 
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Appendix A: The Gamma-Dirichlet distribution 

We provide proofs and more detailed exposure to the claims 
made in Section [ni-B[ We formally introduce a straightforward 
generalization of the Dirichlet distribution, which we call the 
Gamma-Dirichlet distributiot^ We detail the derivation of the 
p.d.f. and the sampling procedures of the conditional distribution 
Z\a^Z = t 

The Gamma-Dirichlet distribution. The regular simplex 
^ ^YiQ convex hull of the elementary vertices 

n]' Given a vector k G and a vector 6 G 

we define the Gamma-Dirichlet distribution: 

X ^VD{k,e) 

^Even if this extension is quite simple, to our knowledge, the following results 
have not been presented in the literature so far. 
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with values over the regular simplex S" as the normalized sum 
of n elements drawn from independent Gamma distributions: 



with 



(4) 



and Yi all pairwise independent. The Gamma-Dirichlet distri- 
bution is a simple generalization of the Dirichlet distribution: if 
= al for some a > 0, this is the Dirichlet distribution of the 
nth order. The definition gives a straightforward procedure to 
sample some values from X. We now present some new results: 
we give a formula for the density function that is amenable for 
computations and we describe some heuristics that speed up 
computations by some significant margin with no significant 
loss of precision in practice. Finally, we study the relation 
between the Gamma-Dirichlet distribution and the conditional 
distribution Z\a^ Z = d, in which a vector of independent 
Gamma distributions is jointly constrained on a hyperplane. 
Specifically, we show this distribution over Z is equivalent to 
the Gamma-Dirichlet distribution. 

Density. Note first that one needs to be careful in defin- 
ing the underlying a-algebra of our probability space, as the 
values of X are located in an embedding of of mea- 
sure (a hyperplane). Consider the n-dimensional hyperplane 
T-L^ = = l}. This hyperplane includes the simplex S'^. 

The Lebesgue measure of this set in is zero. However, 
we can consider the Lebesgue measure /i defined over R^~^ 
and the transform: (j) : W^~^ defined by (ii) = 

(^u^l — Uj^ . This transform is a mapping, so it lets us 
define a new measure fi for the space based on jl. Under 
this measure, the measure of the simplex S'^ is positive. Call 
/i the measure defined over 5^ by /i (•) = /i A (')• With 

respect to this measure, the Gamma-Dirichlet distribution has 
density function: 

n 

f{x) = K{k,0)-'l[fr{x,',k,,0,) 

with fr {x;k^O) the p.d.f. of the gamma distribution: 
/r (x; k^O) = r {k)~^ 0~^x^~^e~^ ^. The normalization fac- 
tor n{k^O) is defined by an infinite series, based on a result of 
Moschopoulos pSj: 



/ n \ oo r- 



with 



k — ^ ^ ki 



and a series defined by the recursive formula: 



^0= 1 



Proof: This result can be obtained by first proving the 
equivalence of the Gamma-Dirichlet distribution with the condi- 
tional distribution Z\ = 1, which is done below and does 
not need the scaling constant n. Now, consider the distribution 
of the normalization factor Y = Yj . This variable is a 
sum of independent Gamma variables, with p.d.f /y. Then 
K, (/c, 0) = /y (1.0). An expression for this coefficient in terms 
of a converging series of Gamma coefficients is given in |8|, 
|25|. ■ 
The computation of Si is quadratic in time, which can be 
too slow if tens of thousands of coefficients are required before 
convergence, as it may happen in our application. We present 
below some heuristics to speed up the computations of this 
sequence. 

Equivalence with conditional Gammas. Consider the con- 
ditional distribution Zi = Yi \ Yj = 1 with Yi defined in 
Equation [4] The density function of this distribution is over 
nearly everywhere, however it has non-zero measure over 
the regular simplex S'^. Given a measure /i defined over 5^, 
the two distributions are the same: 

Vxg5-, frD{x)=f{x) 

Proof: This proof is adapted from a similar proof f\4] for 
the Dirichlet distribution. Using the same notations as above, 
define Y = Yj and Xi = Yi/Y for i < n — 1. The joint 
density for the F's is: 



by: y = Y^i^i^Vi and 
y ^0~^yi for i < n — 1. This mapping is invertible and 
its Jacobian at y is - 0~^) y. Thus the joint density of x) 



Define the transform (p : 



is: 



/ n-1 \^ri^ 

V i=l ) i=l 

This shows that the variables y and x are independent, and that 
the distribution of x is a Dirichlet distribution. 

Define the transform : y = ^iyi and Xi = y~^yi for 
k < n — 1. This mapping is also invertible, with Jacobin y^ . 
The joint density of {y^x) is: 



g{y,x) (x\l-^Xi\ W 

By identification, we get: g {x\y = 1) = g[A~^x) with 
A the diagonal matrix defined by A^^ = Oi. Since 
(f~'^ ^(A~^x, l)^^ = the result ensues. ■ 
Sampling from conditional gamma distributions. Consider 
a set of n independent Gamma distributions Ti ~ F (/c^, 6>^), a n- 
dimensional vector of positive numbers a G (R+) and t > 0. 
The purpose of this section is to present some practical formulas 
to sample and compute the density function of the conditional 
distribution: 



t 



We define this distribution over the n-dimensional simplex 



T 

a X 



As before, we define a new measure over the hyperplane defined 
by a^x = t, and use it as our base measure dz for Z. We call 
/ the probability density function of variable Z with respect 
to this measure. With respect to this measure, the probability 
density function of Z is that of a Gamma-Dirichlet distribution: 



1 



with: 



n- 



Proof: Call h the scaling transform x 



defined by y = h{x) : i/i = ait~^Xi. Then all points from 
Sa,t are mapped into the n-regular simplex Since this 

transform is a linear scaling, one only needs to find the volume 
of Sa,t to define the new probability distribution. Call Sa,t the 
n-standard simplex defined by the origin coordinate and by its 



Its volume is 



vertices (t ^.^.,.^r, , 
The volume of the hypersurface Sa,t can be found by differen- 
tiating the scaled volume sSa,t along the normal axis 
for s = 1: 



ds 
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ai 



IS 



we get 



=1 (n-l)! K 

Since the volume of the n-regular simplex S'^~^ 
our result. ■ 
Because of this equivalence between Gamma-Dirichlet distri- 
bution and independent Gamma distributions conditioned on a 
hyperplane, we can use the straightforward sampling algorithm 
from the definition of the Gamma-Dirichlet distribution to 
sample values from the conditioned Gamma distribution. This 
algorithm is presented in Algorithm [T] 



// Shuffling 

// First double is tt, second double is weight 
val sainples_by_node = { 
em_c omp 1 e t e_ s amp 1 e s 

.flatMap({ case (cs, w) => 
(cs .mask. nodes zip cs .partial) .map({ 
case (node, partial_tt) => 
(node, (partial_tt, w)) 

}) 

}) 

. groupByKey 

} 

val new_node_states = samples_by_node .map({ 
case (nid, values) => 
val tts = values .map (_._1) . toArray 
val ws = values .map (_._2) . toArray 
val new_state = GammaLearn.maxLikelihood(tts , vv 
(nid, new_state) 

}) 

new_node_states . foreachRDD(state_rdd => { 
// This code will be called on the master node 
// every time a new RDD is received, 
val new_state = state_rdd. collect . toMap 
printlnC Computed new state: "+new_state) 

}) 



Appendix B: code snippet for the Spark program and 
THE Streaming Spark program 

Main program, written in Scala using Spark Streaming. 

// Creates the spark context 

val ssc = new StreamingContext ( . . . ) 

// Build the stream 

val data_stream = createDStreamFromConfig(ssc , config) 
// Asynchronous stream computations start here. 
// Create samples 

val em_complete_samples = data_stream. flatMap({ 
case (p_ob, w) => 
val state = DistributedMutableStorage . state 
val complete_samples = GammaSample . getSamples( 
P-ob, 
state , 

num_em_samples) 
complete_samples .map((_, w)) 

}) 



